Explorations into Unsupervised Corpus Quality Assessment
نویسندگان
چکیده
Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. They are for instance used to train systems on specific tasks through machine learning. When a system is trained on a corpus of low quality, it will not provide reliable results. We search for a metric of corpus quality by comparing the vocabulary growth data, the Zipf slopes and the Pareto exponents of corpora of different sizes, compositions and qualities in Dutch and English. Vocabulary growth curves are a great tool for directly spotting major deviations in a text. Our method applies a linear regression on these curves, which unfortunately does not capture the small errors that contaminate a corpus, like mistakes in spelling. The Zipf slope and Pareto exponent do show explicable deviations when the quality of a corpus changes. However, their standard values of 1 and 2 respectively, which are reported in much of the literature, also deviate for different corpus sizes. Consequently, we did not find a solid metric of corpus quality, but a crude measure can certainly be derived from our results.
منابع مشابه
Automatic Selection of High Quality Parses Created By a Fully Unsupervised Parser
The average results obtained by unsupervised statistical parsers have greatly improved in the last few years, but on many specific sentences they are of rather low quality. The output of such parsers is becoming valuable for various applications, and it is radically less expensive to create than manually annotated training data. Hence, automatic selection of high quality parses created by unsup...
متن کاملAssessing Quality of Unsupervised Topics in Song Lyrics
How useful are topic models based on song lyrics for applications in music information retrieval? Unsupervised topic models on text corpora are often difficult to interpret. Based on a large collection of lyrics, we investigate how well automatically generated topics are related to manual topic annotations. We propose to use the kurtosis metric to align unsupervised topics with a reference mode...
متن کاملDisentangling from Babylonian Confusion - Unsupervised Language Identification
This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approache...
متن کاملQuality Assessment of English-into-Persian Translations of Tourism Management Academic Textbooks
This paper addresses the quality of the Persian translations of 32 English tourism textbooks. The qual- ity was assessed at sentence-level and page-level by the researchers and from the viewpoint of a tour- ism management student. In Phase 1, the quality of one randomly selected sentence from each text- book was assessed applying Hurtado Albir‘s analytical model; two were acc...
متن کاملOn the Assessment of Text Corpora
Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreove...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008